Final Project - Explainer notebook

Motivation

The main dataset that was used in this project is an open-source dataset containing property registration data for the Icelandic housing market, which was downloaded from this source: https://www.fasteignaskra.is/gogn/grunngogn-til-nidurhals. It contains information on historical purchase prices, property address, postal codes, square meters, numbner of rooms, type of hosuing, etc. To obtain geological data a supplementary dataset was needed containing latitude and longitude values for each property, and by merging these two datasets together the final dataset was created and ending being 75.4 MB in size.

The idea was to explore the Icelandic Housing market – which is a niche market – but as the team members are from Iceland, we thought it would be a great idea as the housing market is always relevant and the ideas of our project can be used for other housing market data, hoping to give insights to potential buyers, as it is for most people the main investment of their life.

The goal was to create a compelling and insightful story of the Icelandic housing market. Looking at price changes over the years, best buys and just in general fun and interesting things to look at, e.g. what property has been most often bought/sold over the span of 18 years. The end goal is that the reader that has had no previous experience with the Icelandic housing market will have gained insights and would have a highlevel overview of the market as a whole.

Basic stats

To better understand the dataset, we utilized the set of tools acquired during the course. The sections Data cleaning and preprocessing and Data stats take you through it. We also include a short subsection called Other datasets where we discuss the supplementary datasets to our main dataset, where we got them and what insights they can give to our main dataset.

Data cleaning and preprocessing

We started with importing necessary packages, reading in the datasets and merging them into a single dataset and taking a closer look at the data we had.

The dataset contains 193.005 records of property registration with 52 columns of different data, some more relevant and useful than other. By taking a look at descriptions of these different columns we were able to identify the most important features. Here are links to the descriptions, but they are in Icelandic so google translate comes in handy: https://www.fasteignaskra.is/gogn/grunngogn-til-nidurhals/kaupskra-fasteigna/eigindalysing-kaupskrar & https://www.fasteignaskra.is/library/Samnyttar-skrar-/Fyrirtaeki-stofnanir/Nidurhal/Sta%C3%B0fangaskr%C3%A1%20eigindal%C3%BDsing.pdf.

The final set of columns/features we chose are these (english translation of the column to the right of the back slash):

For better code transparency we decided to translate the column names from Icelandic to English. Also, translating the "types" of housing from Icelandic to English and droping all rows that have houses of type "Commercial property", "Other", "Garage/Shed" or "Summer house" as it was of less interest to us and we wanted to focus on properties the "average Joe" would buy as their main property.

An interesting metric when looking into property data is the purchase price per square meters, thus we add it in as a new column to our data frame, where the unit of measure is thousand ISK per $m^2$.

We then did the common convertion of setting the date column to be in datetime, the year built to integer, as well as stripping whitespace from all string columns. We also scaled the price and value columns, which were given in thousand ISK, to be in million ISK instead, since we are dealing with large numbers that usually run in millions.

To work with relevant property data we decided to filter on unenforcable_contract = 0 since we only wanted useable records. Useable meaning a contract/agreement that can be used in evaluation of the market, a record that is belived to be correct and not biased (for example an unuseable contract would be a contract between parents and children, selling them the property for much less of the market value). Then there were a few properties connected to the postal code of 611 since it is connected to a remote island in the North of Iceland, and deemed as an outlier in this analysis.

Finally we decided to enrich our data with the different regions of Iceland, which are 8 in total. We used this list https://is.wikipedia.org/wiki/Listi_yfir_%C3%ADslensk_p%C3%B3stn%C3%BAmer and also domain knowledge of postal codes in Iceland to map the regions, according to the postal codes. The regions of Höfuðborgarsvæðið and Suðurnes did not follow a nice pattern so they were manually inserted, but other regions were easier to map.

Above is the cleaned and processed dataset, and after the filtering and pre-processing we have 137.415 records. We will analyze the dataset further in the coming sections.

Other dataset(s)

To enrich our data we needed other supporting datasets, and this small section will. Some are less interesting like the consumer price index, used to adjust the prices to inflation in the histogram plots above (the data was found through this link: https://px.hagstofa.is/pxis/pxweb/is/Efnahagur/Efnahagur__visitolur__1_vnv__1_vnv/VIS01000.px).

Then we had two different interest rates, index and non-index. Here is link to the data of the rates:https://www.sedlabanki.is/annad-efni/meginvextir-si/. We then did some minor transformation and cleaning of the interest rate data as seen below.

Lastly we found more fun data, namely shapefile data of the different regions in Iceland, which are 8 in total. We had already mapped the different regions into our dataset and then found the shapefiles for the different regions from the following data source: https://simplemaps.com/gis/country/is#admin1. To do a quick visualization of the regions we plotted a simple choropleth of the contract distribution of the regions, and to no surprise the captial region (Höfðuborgarsvæði) dominates the housing market.

Data stats

In this section we wanted to understand in more detail the distribution of the dataset for different timeperiods, different municpalities and different data. We also wanted to understand the basic stats of the dataset as well the geological representation of the data.

We started by looking at the time horizon of our historical property contracts and how many municipalites are in Iceland discovering that the data is from May 2006 to March 2024, containing data for 62 different municipalities and 157 different postal codes (after doing some pre-processing).

We then created a simple bar chart of the historical distribution of contracts per year. The plot below reveals that there was a spike in 2007 which then drops drastically in 2008, the most likely reason for this trend being the financial crisis in 2008. From 2009 up until 2021 there is an upwards trend but now in the last 2 years it seems the housing market has cooled down from the record year of 2021.

Then we plotted a bar chart showing how many historical property contracts have been made in each municipality.

No surprise that Reykjavíkurborg, the capital of Iceland, has the far greatest number of contracts, and smaller towns in the countryside have very few in comparison. By looking at the graph, and applying domain knowledge as "experts" in being from Iceland, we chose our "focus" municipalites to be the top 16 with regards to number of contracts. Most of them are from the capital region but we also have big towns in Eastern and Northern parts of Iceland.

To further understand the distribution of our dataset we decided to plot a bar chart showing the number of purchase agreements within our focus municipalities.

Looking at the number of contracts per municipality plot we see in greater detail the distribution of our focus municipalities, the top 3 are from the capital region but in 4th place we have Akureyrarbær which is the largest town in the North of Iceland and in 5th place we have Reykjanesbær which is located on the Southern Peninsula of Iceland.

It also made sense to plot the distribution of historical housing prices, using a histogram and also calculate the inflation adjusted price using the consumer price index (CPI) from the Central Bank of Iceland.

Hmm...above plot is not that nice to look at, but there seem to be some absurdily expensive properties, which is maybe not in a sense an outlier, but never the less, something out of the ordinary. If we only look at purchase prices less than 200 mISK we should get a better feeling for the distribution of the purchase prices.

Well, yeah, of course there are always some properties that go way above the market trend...but hey, here we have fairly nice gaussian like bell, albeit a little skewed and has a long tail, indicating the more expensive properties.

Also, it makes sense to look at a scatter plot, where we plot square meters agains purchase price (adjusted for inflation). Here we would guess that with increasing size you would get an increase in purchase price, but it may be quite location based (where the property is located in Iceland), so not a linear relationship for the whole dataset.

Yes, some upward trends but it's really clustered and not much insight to gain from this, other than it can be quite a big difference when it comes to purchase price for the about the same sized properties. Only the biggest ones, past 350 square meters that look like to follow a more upwards trend for the whole dataset. This gave us the direction of exploring if this was true for all locations (regions/postal codes) or if we could differentiate by looking at specific locations.

Yes, much better, and clearly, the Capital Region (Höfuðborgarsvæðið) is much more expensive than the other regions.

Since we have latitude and longitude data of all registered residents we wanted to visualize what regions in Iceland have higher density, and we did that with drawing dots for each contract, having the home address as a popup. We arbitrarily chose to display all contracts from 2021, just to get a more high level overview.

The plot show similar insights as before, that the capital region is the most dense in population but also that in the Eastern parts properties are more scarcely distributed than in the Northern parts.

Data Analysis

The Data Analysis section aims to describe our data analysis and explain the insights found via the data exploration we conducted.

Bar charts - Trying to detect fundamental patterns

Here we started out with focusing in more detail on each municipality as we plot the number of contracts for each municipality per year and month to see if we can find any municiaplity specific patterns.

From the yearly plot we notice that most of the municipalities follow the same trend, peaks in 2007 and 2021 with downward to upward trends in between. However Vestmannaeyjabær sticks out as it doesn't have sharp and distinct patterns, a possible reason could be the fact that it is an island where the local people rarely move as they and their families have lived there for a long time, and can't really think about moving away.

From the monthly plot it seems there is a slight seasonality during the summer and later in the year, as we observe an upwards trend in number of contracts from the beginning of the year up until October or November. There is no municipality that is different from the others and overall the number of contracts seem to be the same on average between months.

Focusing on certain municipalities didn't result in interesting and new patterns, not already seen in the initial bar chart in the Data Stats section. We thus decided to use monthly average mortgage rates (indexed and non-indexed) and plot them against a monthly bar chart of number of contracts. This resulted in the following graph:

From the graph we see how the rates rose up from 2007/2008 to better control the economy and stayed high up until late 2013 and becoming stable until they began to lower a little bit in 2022. The number of contracts follows the trend of the rates in contrast, meaning when the rates go up the number of purchase agreements go down. We can see a top in 2021, which is propably an indirect result of COVID-19 as it likely fueled an increase in savings during the lockdowns caused by the epidemic.

Turning back to investigating trends of the focus municipalites we decided to calculate the average of purchase prices, size in square meters and number of rooms of the properties as well as the most frequent year built, per municipality. This was done to gain insights into the characteristics of each focus municipality in terms of these metrics.

From the above graph we observe that the most expensive municipalities are Garðabær and Seltjarnarnesbær, while the least expensive are Fjarðabyggð and Múlaþing in the Eastern parts of Iceland, but also Ísafjarðabær in the Westfjords of Iceland. Oldest houses are in Reykjavík, no real surprise there, but perhaps more interesting, for most other municipalities most properties are built from 2000, indicating that many new properties have been built the last 20 years. We see that the most expensive municipalities Garðabær and Seltjarnarnes have the biggest properties on average but it’s Fjarðarbyggð that has the most rooms on average, or 3.5 rooms on average.

Price trend - interactive plot with bokeh

We then decided to look at price change per our focus municipalites. A time series plot was chosen for the analysis, where we used a floating window size to calculate a rolling average of the historical purchase prices, trying to smooth out the data series while still capturing the underlying trend.

An interactive bokeh plot was used to allow users to play with toggling in and out the desired municipalites and hopefully derive their own insights.

From the graph above we see that the different municipalities follow the same trend in general, although in recent years there is a greater difference in trends such as for Múlaþing where the prices have risen quite rapidly the last 2-3 years perhaps indicating more a higher demand in the Eastern parts of Iceland.

For a new perspective, we also decided to calculate and plot the price trends of the different property types; Apartment, Plexi/semi house and Private house.

Interstingly we see that the price trends of the three different property types nearly follow the same trend through the years, and seem to be highly positively correlated.

Maps of Iceland

This section is dedicated to the data analysis and exploration done using the latitude and longitude data to have a geological representation of the data, hopeufully revealing new insights not possible with simple bar charts or time series.

Cheap vs expensive regions/areas

We further analysed the data and wanted to get a better understanding of cheap and expensive regions and/or areas in Iceland. To begin with we drew a choropleth of the 8 different regions in Iceland, and calculated the average purchase price and average purchase price per $m^2$ of each region, color coding the plot with redness, that is, the more expensive the more red the region is. The resulting plots are seen below, and interestingly there is no difference between looking at the average price vs price per $m^2$. Following this we decided to 'zoom' in on the capital region and try to draw out some trends in prices between the ares of the capital region.

In the two plots below we zoom in as we wanted to see what areas in the capital region, defined by postal codes, are the most expensive or the cheapest. We used color codes to visualize cheap areas to expensive areas where blue is cheap and red is expensive, and to also capture the size of the areas, measured in number of contracts, we used circles where the smaller ones represent smaller neighbourhoods and the larger circles represent bigger areas having larger number of houses. We also made use of a hovering tooltip with extra info about the regions, such as the postal code, the municipality, the price and the number of contracts used to get the average and a measurement of how dense the population is within each postal code.

The first plot shows the average purchase price of properties, meaning it shows in what areas on average it is the cheapest/most expensive to buy a property. The second plot shows the average purchase price per square meter, meaning it shows where the "best buy" properties are if you want to get the most for your buck...or where you can get the most expensive square meters.

Reading from the plot there are two areas that catch the eye being fire red. Those are postal codes 102 (Reykjavík, Skerjafjörður) and 210 (Garðabær), having average purchase price of 67.34 and 62.76 millon ISK respectively. The cheapest postal code area is 111 (Reykjavík, Breiðholt) having average purchase price of 31.16 million ISK. Looking at the big picture we see that all in all Reykjavíkurborg seems to have the cheapest areas.

The second figure, displaying areas according to the average purchase price per $m^2$, can be seen below:

When looking at average price per $m^2$ different pattern appears, now the cheapest postal code area is 109 and 111 in close second (both postal codes in Reykjavík, Breiðholt) with average price per $m^2$ of 332 and 335 respectively. The most expensive area is still 102 (Reykjavík, Skerjafjörður) with a whooping 629 in average price per $m^2$. The second most expensive area is 103 (Reykjavík, Háaleitis- og Bústaðahverfi) with average price per $m^2$ of 515 which is quite a difference. Taking everything in it seems that more suburban areas are cheaper and the more centrally located you are in Reykjavík the higher the square metere price is.

The most isolated houses in Iceland (machine learning)

We wanted to explore different but fun insights of the housing market dataset, and finding the most isolated houses was an interesting one. We could have done extensive calculations to find a complete distance matrix between all different properties in Iceland. We decided not to, as it is very computationally heavy. Instead we went with the BallTree algorithm of scikit learn (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html) which organizes the points (lat and lon) in a space-partitioning data structure that allows for quick nearest neighbor queries (reference: https://www.geeksforgeeks.org/ball-tree-and-kd-tree-algorithms/). We thought it was a perfect fit as it is quick but generates nice insights to the geographic analysis we wanted to perform. We again used the hovering tooltip to feed extra info into the graph so the users can hover over and be able to see info about when the house was built and the isoaltion score, to name a few.

The most isolate house in Iceland is Möðrudalur, a private house built in 2007. Doing a little bit of digging we found out that this is actually a farm settlement and the highest inhabited place in all of Iceland coming in at 469 m above sea level. The closest neighbour of Möðurdalur is the property Klaustursel lóð 1, which is also a prive house built in 1956. The distance between the two is approximately 43 km (straight line). The second most isolated location is Hólssel lóð having the closest neighbour at around 32 km distance, and the third most isolated is Reynivellir Ekra, a new house built in 2019 and is around 26 km away from the nearest neighbour.

Best buy over the years?

Calculating the "best buy" for each property, only looking at the difference of purchase_price and current_property_value from the time the property was last bought, getting percentage, value_change, indicating the percentage change in value of the property. We could have also looked at the purchase price, but for the grand scheme of things, that value is not important as most property purchases in Iceland are influenced by the property value, often times only going couple of millions kr. above or below the property value. Not the case for every property but for this analysis we made that simplification. Focusing on the capital region

House size groups

Zooming back in on the capital region, we wanted to see if any patterns were related to the location of the smallest houses (houses equal to or smaller than 25 $m^2$) and the bigger mansion like properties of size 500+ $m^2$. We colored the smaller houses blue and the bigger red and used the size as a factor into the radius to better visualize how big the biggest houses are. We again used a tooltip to enable hovering over the circles and gain more details of the property. We also decided to add in a layer which makes it possible to toggle in and out the three different property types, if one is keen to only look at a certain type.

We see that the further we are away from the capital region the bigger the houses get, and one sticks out as being the largest. That is the property at Eskivellir 13 which is 2740 $m^2$ in size. The more central one is to Reykjavík the smaller the houses tend to get, and to no surprise most of the smallest are apartments. The smallest private house is however located at Njálsgata 60B, coming it at 20 $m^2$, which is rather small but still having 2 rooms.

Oldest houses in Iceland

We ended with exploring where the oldest houses in Iceland are located, using circle markers with a color scale indiacting how old it is. We chose to display the 30 oldest houses in Iceland, and further information can be gathered by hovering over the markers.

We see that the top 30 oldest houses in Iceland, still inhabitable, range from being built 1824 to 1880. From searching around the map we find that the oldest house is located in Akureyri at address Lækjargata 2A.

Genre

The genre we chose for our data visualization story of the Icelandic housing market was the Magazine style but with some Annotated chart mixed in to give a bit more information per plot. We wanted the story to be mostly author-driven, which fits perfectly with the magazine style. It allowed for a guided story telling of historical trends in the Icelandic housing market, as well the geological representation that influcence the market. We wanted to make somewhat complex data/information accessible and more engaging to a wider range of audience. We still wanted to find a balance between the author-driven approach and the reader-driven, and thus decided to include interactiveness allowing the reader have is own story discoveries, focusing on things that where not only informative but also fun!

For Visual Narrative we chose the following tools for each of the 3 categories:

  1. Visual Structuring
    • We chose the tool Consistent visual platform as a way to have consistency between different figures, such as having the same colour shcema between figures, the same font and font size etc. The goal with having the visual platform consistent was to hopefully make it easier for the reader to navigate through our datastory and not wasting time in understanding new figures each time. Other tools were not used/relevant for our story telling.
  2. Highlighting
    • We used Feature distinction in our geological data representation by using cool-warm color bars to highlight differences between regions, different types of houses or different elements of the data. We also used different sized circles to visually show quantitative difference, making it easier for the reader to quickly grasp the difference we wanted to communicate to him. We also used the Zooming tool, which is great as it enables the reader to look closer at certain areas which interest him and makes it easier to gain more detailed insights to the data story.
  3. Transition Guidance
    • We didn't use any particular transition guidance as it was linear/magazine style, meaning the reader would just need to scroll down or up to transition from one part to the other.

For Narrative Structure we chose the following tools for each of the 3 categories:

  1. Ordering
    • We did a mix of Linear and Random Access, where the first part should be read lineary but after that, it's more about getting the reader engaged, so not neated to be in a specific order, although, the narrative was quite linear in a whole.
  2. Interactivity
    • We used Hover Highlighting and Filtering/Selection as it was the most appropriate one for the story. Limiting the user to simple engagement, not explicitly needing to have instructions on using the interactivity.
  3. Messaging
    • As it was magazine styled, appropriately we used Introductory Text and Captions/Headlines, to give the reader sense of direction going through our webpage.

Visualizations

We choose our visualizations based on the agende "The Icelandic Housing market", that is, for most being a foreign market, we wanted to give a high level view of it to begin with. That ment choosing visualizations that gave the reader a good overview of the topic and the dataset, like historical trends of number of purchase agreements and price trends. Ones an initial understanding was established, we wanted to go in more depth, discovering geological trends, and give the reader a more personal experience, where he could zoom in on the things he thought was the most interesting. As well gaining knowledge where he could possibly see him self living in Iceland. Finally, wanting to leave a smile (hopefully) on the readers face, we explored some more trivial facts about the housing market, as we wanted to keep the reader engaged through out the read. After all, not every one is wildely fascinated by the housing market and looking at price trends and best buys.

Discussion

Going quite blind into the dataset, never explored it before and it was also something we found on our own (not coming from Sune, meaning it had not been "tested" in this course before to our best knowledge) we thought it went really well. We had initial problems with the main dataset as it was lacking geological coordinates, but thankefully, we found a supporting dataset that had that information, joining them on a relevent key, giving us the information needed. Also, creating interesting visuals went well, utilizing the knowledge from the course.

What we think is missing in our creation would be the interactivity. Improving on it, focusing on having "fewer" graphs but with the ability to change from let's say viewing average price to average $m^2$ price. Also, fine tuning relevant visualizations would also be needed, as we often have similar visualizations through out the project, giving a more variety to the experience. That could also be because of the limitations of the data or us, the autors, lacking experience in validating what's the most relevant information to display for this type of data.

Contributions

The two authors of the data story and the explainer notebook are s232411 and s232738. All parts of the final project were discussed together and decisions taken mutually and all text/code that was written by one member was reviewed by the other. That being said, there were key differences in the main contributions, which are the following: